Data Replication and Consistency

Overview

Data replication is an integral aspect of the storage node architecture. This document describes how data replication is configured for the RHQ Storage Node. It also covers consistency as it related to replication.

Replication

Cassandra uses consistent hashing and forms a token ring for partitioning data. In earlier versions of Cassandra (prior to 1.2), each node was assigned a single token and therefore owned a single partition on the ring. Virtual nodes were introduced in 1.2 and that allows for a node to be assigned multiple tokens and therefore own multiple partitions. RHQ uses virtual nodes in large part because it greatly simplifies the work of reassigning tokens and rebalancing the cluster after nodes are added or removed. As the cluster size changes, the RF of each is changed according to the following:

Number of nodes	system_auth RF	rhq RF
1	1	1
2	2	2
3	2	2
>= 4	3	3

The first thing to note is that the RF of neither system_auth nor rhq is user-configurable. This is intentional for a couple reasons. First, understanding the ramifications of changing the RF requires a certain level of understanding of Cassandra. The management of Cassandra in RHQ is predicated on users not having an in-depth understanding of Cassandra. Secondly, the RF impacts application code. Not only does it impact performance of both reads and writes, but it also impacts the consistency guarantees of client read/write operations.

Consistency

Consistency requirements are configurable on a per-operation basis. RHQ uses consistency level (CL) ONE for all read/write operations. This means that for multiple nodes, we cannot guarantee strong consistency for reads which is in start contrast from the relational database. It is not as bad as it sounds however. In the absence of failure, replicas (nodes that share the same token(s)) should be consistent. Furthermore, when a client query is executed, Cassandra does read repair in the background to ensure that all replicas have the latest known values. RHQ runs anti-entropy repair as part of regularly scheduled storage cluster maintenance to ensure that replicas are consistent.

The consistency level with which writes are executed has no impact on alerting. Alert notifications are generated from live data received from the agent.

The following sections discuss the implications of each RF.

One Node

With a single node there is no data replication, and all reads guarantee strong consistency. In other words, a query will return the latest value(s) for the columns/rows being queried.

Two Nodes

All writes go to both nodes. Inconsistent reads are possible; however, all reads and writes can be served with a node down.

Going from one to two nodes will not decrease the size of data on disk!

Three Nodes

The RF remains 2 which means metric data for a given schedule ID will be stored on two of three nodes (since the schedule ID is the partition key). Inconsistent reads are possible since both reads and writes are done at CL ONE. The footprint on disk of node will be reduced since each node no longer owns all partition keys.

If one of the nodes goes down, we can continue performing both reads and writes without any problem since one of the replicas is still up.

If two nodes are down, then some reads and writes will fail, but not necessarily all of them. Even though there is a still a live, healthy node, both of the replicas for a partition key to which we are reading or writing are down. Then we get a UnavailableException on the client side. Authentication and authorization checks are also performed on requests at CL ONE. Remember that the RF of the system_auth keyspace is 2; consequently, a write request for example, could fail during the authorization check without the write ever being attempted. Each node stores permissions in a local cache. A query for the user is performed on a cache miss. That query can fail if both replicas are down. Unfortunately, the UnavailableException gets wrapped. In rhq-storage.log we will see,

ERROR [Native-Transport-Requests:551] 2013-09-09 16:30:01,338 ErrorMessage.java (line 210) Unexpected exception during request
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3990)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878)
	at org.apache.cassandra.service.ClientState.authorize(ClientState.java:290)
	at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:170)
	at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:163)
	at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:147)
	at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:67)
	at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:100)
	at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:223)
	at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:121)
	at org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:287)
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43)
	at org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE
	at org.apache.cassandra.auth.Auth.selectUser(Auth.java:247)
	at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:84)
	at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50)
	at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:68)
	at org.apache.cassandra.service.ClientState$1.load(ClientState.java:276)
	at org.apache.cassandra.service.ClientState$1.load(ClientState.java:273)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
	... 20 more
Caused by: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE
	at org.apache.cassandra.db.ConsistencyLevel.assureSufficientLiveNodes(ConsistencyLevel.java:250)
	at org.apache.cassandra.service.ReadCallback.assureSufficientLiveNodes(ReadCallback.java:161)
	at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:870)
	at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:805)
	at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:133)
	at org.apache.cassandra.auth.Auth.selectUser(Auth.java:236)
	... 29 more

but on the client side (in the RHQ server log) we will see something like,

16:35:01,821 ERROR [org.rhq.server.metrics.MetricsServer] (New I/O worker #24) An error occurred while
inserting raw data MeasurementDataNumeric[name=metric004, value=95067.0, scheduleId=14837, timestamp=1378758886941]:
com.datastax.driver.core.exceptions.DriverInternalError: An unexpected error occured server side:
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException:
org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE

Four Nodes

The RF increases to 3 for both the system_auth and rhq keyspaces. Metric data for a given schedule ID will be stored on three nodes. The footprint on disk for each node will not decrease as we are increasing the number of replicas. Inconsistent reads are possible since both reads and writes are done at CL ONE.

We can perform all reads and writes with two nodes down. With three nodes down though, reads and writes will start to fail as previously described.

Five or More Nodes

The RF stays at 3 for both the syste_auth and rhq keyspaces. The footprint on disk for each node will decrease. Inconsistent reads are possible since both reads and writes are done at CL ONE.